## library(devtools)
## library(BiocManager)
library(SingleCellExperiment)
library(ggplot2)
library(scFeatures) ## devtools::install_github("SydneyBioX/scFeatures")
library(ClassifyR) ## BiocManager::install("ClassifyR", dependencies = TRUE)
library(ggthemes)
library(spicyR) ## BiocManager::install("spicyR")
library(dplyr)
library(limma)
library(plotly)
library(scattermore)
library(tidyr)
library(survival)
library(survminer)
library(spatstat)
library(lisaClust)
##library(spatstat.core) ## install.packages("spatstat")
##library(spatstat.geom) 
library(scater)
library(scran)

theme_set(theme_classic())

1 Overview

As single cell technology advances, the recent development of spatial omics allows us to examine the spatial organisation of cells within tissues in their native environment. This workshop will discuss the challenges and analytical focus associated with disease outcome prediction using multi-sample spatial datasets. We will also talk about general analytical strategies and the critical thinking questions that arise in the workflow.


Preparation and assumed knowledge

  • Knowledge of R syntax
  • Familiarity with the SingleCellExperiment class
  • Ability to install all required R packages, please check sessionInfo at the end of this document to ensure you are using the correct version.

Learning objectives

  • Understand and visualise spatial omics datasets
  • Explore various strategies for disease outcome prediction using spatial omics data
  • Understand and generate individual feature representations from a cell-level expression matrix
  • Develop appreciation on how to assess the performance of classification models
  • Perform disease outcome prediction using the feature representation and robust classification framework


Time outline

Structure for this 2-hour workshop:

Activity Time
Introduction to spatial technologies 20m
Cell segmentation with deep learning (with BIDCell) 20m
Exploring spatial data 20m
Break (Q&A) 10m
Extracting features from spatial data (with scFeatures) 30m
Performing disease outcome classification (with ClassifyR) 20m

2 Initial exploration and visualisation

Data and background

Imaging mass cytometry (IMC) is a new imaging technique that gathers spatial information to create images that show the distribution of different cell types and their associated protein expression patterns in tissue. In this demo, we examine IMC dataset taken from:

Moldoveanu D et. al. Spatially mapping the immune landscape of melanoma using imaging mass cytometry. Science Immunology, Apr;7(70):eabi5072. doi: 10.1126/sciimmunol.abi5072. PMID: 35363543.

Here, the authors quantified the expression of 35 protein markers in two melanoma cohorts. Basic characteristics of the data objects:

  • The dataset contains 35 proteins and 112,497 cells.
  • The outcome is 16 non-responders (Response = No) and 14 responders (Response = Yes).
  • 17 patients received CTLA4 treatment, 8 patients received PD1, 5 patients received both.
  • From a range of tissue sources and primary sites.
data_sce <- readRDS("data_sce.rds")
data_sce <- logNormCounts(data_sce)

print("data format")
## [1] "data format"
data_sce
## class: SingleCellExperiment 
## dim: 35 112497 
## metadata(0):
## assays(2): counts logcounts
## rownames(35): SMA PDL1 ... ICOS CD29
## rowData names(0):
## colnames(112497): 26BL:1 26BL:2 ... 21RD:5410 21RD:5411
## colData names(33): orig.ident nCount_RNA ... ident sizeFactor
## reducedDimNames(2): PCA UMAP
## mainExpName: RNA
## altExpNames(0):
print("expression matrix is stored in proteins by cells matrix")
## [1] "expression matrix is stored in proteins by cells matrix"
logcounts(data_sce)[1:7, 1:7]
## 7 x 7 sparse Matrix of class "dgCMatrix"
##          26BL:1   26BL:2     26BL:3     26BL:4     26BL:5     26BL:6     26BL:7
## SMA   1.4284103 1.641814 1.41206854 1.88787898 0.56130933 0.04550849 1.64393850
## PDL1  1.2555909 1.181598 1.28170915 1.54414352 0.07956439 0.44187590 1.15627466
## OX40  .         .        .          .          .          .          .         
## CD45  .         .        0.02253247 0.02522008 0.97111591 .          .         
## LAG3  .         .        .          .          .          .          .         
## TIM3  0.8617437 0.854122 0.97635985 0.91544402 1.03250567 1.16430750 0.92958376
## FoxP3 .         .        .          .          .          .          0.05413394
##print("the object stores meta data (such as patient outcome information) about each cell")
##DT::datatable( data.frame(colData(data_sce))[1:5, ]  , options = list(scrollX = TRUE))

Aim

In this demo, we will examine one of the cohorts - pretreatment melanoma samples from 30 individuals with advanced disease who subsequently received ICI therapy. We will aim to identify features that has the potnetial to discriminate between non-responders (Response = No) and responders (Response = Yes).

Exploration 1: How complex is my data?

At the start of the exploration, it is often good to get a sense of the complexity of the data. We usually do this by exploring the distribution of the outcomes and various variables in the patient’s meta-data. Here, we use cross-tabulation to examine the following variables:

  • responder vs non-responders
  • tissue source
  • primary site
print("number of responder and non responder in each type of treatment  ")
## [1] "number of responder and non responder in each type of treatment  "
metadata <- colData(data_sce)
metadata <- metadata[ !duplicated(metadata$Sample_ID), ]
table(metadata$Response, metadata$Treatment) 
##      
##       both CTLA4 PD1
##   No     1    11   4
##   Yes    4     6   4
print("Number of patients based on tissue source")
## [1] "Number of patients based on tissue source"
table(metadata$Tissue_Source)
## 
##         acral         brain            LN     other met skin_nonacral 
##             3             4             9             5             9
print("Number of patients based on primary site")
## [1] "Number of patients based on primary site"
table(metadata$Primary_site)
## 
##                                    5th left toe                 abdomen 
##                       2                       1                       2 
##                    back                    calf              chest wall 
##                       5                       1                       1 
##                     ear                   l arm left arm and left elbow 
##                       1                       1                       1 
##             left eyelid          left foot sole           left shoulder 
##                       1                       1                       1 
##              left thumb                    neck         right upper arm 
##                       1                       1                       1 
##                   scalp                 shouler              thigh skin 
##                       1                       1                       2 
##                     toe                   trunk                 unknown 
##                       1                       1                       2 
##                   vulva 
##                       1
print("Cross tabulation")
## [1] "Cross tabulation"
table(metadata$Response, metadata$Tissue_Source)
##      
##       acral brain LN other met skin_nonacral
##   No      1     3  5         3             4
##   Yes     2     1  4         2             5

Exploration 2: How to visualise my data?

Typically in single-cell data analysis, we perform dimension reduction to project the high dimensional cell by gene matrix on to 2D space. This allows us to visualise various things of interest, such as distribution of cell types and disease outcomes. In this dataset, cells were classified in the following cell types based on their markers:

  • CD8+ cytotoxic T cells (Tc).
  • CD4+ T helper cells (Th).
  • macrophage and monocytes (macro.mono; CD68+ or CD14+).
  • B cells (B; CD20+), melanoma and melanocytes (melano; SOX10+ or S100+).
  • endothelial cells (CD31+).
  • cells that did not express any of these markers or with conflicting expression patterns were classified as “others”.

Note: for single-cell RNA-seq with around 20,000 genes, we often need to perform some filtering (e.g. select highly variable genes) to reduce the number of features. Here, given we have less than 50 proteins, there is no need to pre-filter. That being said, we provide some sample code below (commented out) to demonstrate how to identify highly variable genes followed by UMAP visualisation in scRNA-seq data.

# gene_var <- modelGeneVar(data_sce)
# hvgs <- getTopHVGs(gene_var, prop=0.1)
# data_sce <- runUMAP(data_sce, scale=TRUE,  subset_row = hvgs)
data_sce <- runUMAP(data_sce, scale=TRUE)
a <- plotUMAP(data_sce, colour_by = "Cluster.v2")
b <- plotUMAP(data_sce, colour_by = "Response")
c <- plotUMAP(data_sce, colour_by = "Sample_ID")
a + b + c

Interactive Q&A:

What can we learn from these illustrations? Is there anything interesting in the plot? Questions to consider include:

  • Q1: Is there patient batch effect?
  • Q2: Are the responder and non-responder patients easy or difficult to distinguish?
print("Optional")
## [1] "Optional"
metadata <- colData(data_sce)
metadata <- cbind(metadata, reducedDim(data_sce, "UMAP"))
metadata <- data.frame(metadata)

plotlist <- list()
thispatient  <-  unique(metadata$Sample_ID)[1]
for ( thispatient in unique(metadata$Sample_ID)){
        metadata$selected_patient <- ifelse( metadata$Sample_ID == thispatient, "seleted patient" , "other patients")
        
       p <- ggplot(metadata, aes(x =UMAP1 , y = UMAP2 , colour = selected_patient  )) + geom_scattermore(pointsize = 0.5) + ggtitle(thispatient) + scale_colour_manual(values = c("grey" , "red"))
         
       plotlist [[thispatient]] <- p
}

ggarrange(plotlist = plotlist , ncol = 5, nrow = 6 , common.legend = T )

Exploration 3: Is there a spatial structure in my data?

The advantage with spatial omics is that we can examine the organisation of the cell types as it occurs on the tissue slide. Here, we visualise one of the slides from a patient. As an optional exercise, you can

  • permute the cell type label
  • permute the spatial coordinates

to give a sense of what is random ordering.

one_sample <- data_sce[, data_sce$Sample_ID  == unique(data_sce$Sample_ID)[1]]
one_sample  <- colData(one_sample)
one_sample <- data.frame(one_sample)

tableau_palette <- scale_colour_tableau() 
color_codes <- tableau_palette$palette(10)

a <- ggplot(one_sample, aes(x = Location_Center_X , y = Location_Center_Y, colour =Cluster.v2)) + geom_point(alpha=0.7) +  scale_colour_manual(values = c(color_codes, "lightgrey")) + ggtitle("Original slide")
print("Optional: Permute the cell type labels")
## [1] "Optional: Permute the cell type labels"
one_sample$Cluster.v2_permute <- sample(one_sample$Cluster.v2)
b <- ggplot(one_sample, aes(x = Location_Center_X , y = Location_Center_Y, colour =Cluster.v2_permute)) + geom_point(alpha=0.7) +  scale_colour_manual(values = c(color_codes, "lightgrey")) + ggtitle("Permute the cell type label")

print("Optional: Permute the spatial coordinate")
## [1] "Optional: Permute the spatial coordinate"
one_sample$Location_Center_X_permute <- sample(one_sample$Location_Center_X)
one_sample$Location_Center_Y_permute <- sample(one_sample$Location_Center_Y)
c <- ggplot(one_sample, aes(x = Location_Center_X_permute , y = Location_Center_Y_permute, colour =Cluster.v2)) + geom_point(alpha=0.7) +  scale_colour_manual(values = c(color_codes, "lightgrey")) + ggtitle("Permute the X, Y coordinate")
a + b + c

Critical thinking:

  • Is there structure in the data real ?
  • What are additional strategies to generate a random distribution?

3 Describing tissue microenvrionments and cellular neighbourhoods

3.1 Do cell type co-localise in specfic regions?

We begin by examining how can we identify and visualise regions of tissue where spatial associations between cell-types are similar. There are many packages that perform this task andhere we use the lisaClust function that is based on “local L-function” to spatially cluster cells into different regions with similar cell type composition.

set.seed(51773)
BPPARAM <- simpleSeg:::generateBPParam(2)

# Cluster cells into spatial regions with similar composition.
data_sce  <- lisaClust(
  data_sce ,
  k = 5,
  Rs = c(20, 50, 100),
  sigma = 50,
  spatialCoords = c("Location_Center_X", "Location_Center_Y"),
  cellType = "Cluster.v2",
  imageID = "Sample_ID" ,
  regionName = "region",
  BPPARAM = BPPARAM
)

3.2 Which regions appear to be different between responders and non-responders?

Visualise regions on individual level

metadata <- colData(data_sce)
metadata <- metadata[ metadata$Sample_ID == metadata$Sample_ID[1],  ]
metadata <- data.frame(metadata)

plotlist <- list()
plotlist_celltype <- list()
thisregion  <-  unique(metadata$region)[1]

tableau_palette <- scale_colour_tableau() 
color_codes <- tableau_palette$palette( 10 )
color_codes <- c(color_codes, "darkgrey" , "grey90") 

names(color_codes) <- c( unique(metadata$Cluster.v2) ,  "other regions")

for ( thisregion in sort(unique(metadata$region))){
        
        selected_region_index <-  metadata$region == thisregion
        
        metadata$selected_region <-  "other regions"
        metadata$selected_region[selected_region_index] <- "selected region"
        
        metadata$celltype <- metadata$Cluster.v2
        metadata$celltype[!selected_region_index ] <-   "other regions"
        
        metadata$celltype <- factor(metadata$celltype, levels = c(unique(metadata$Cluster.v2), "other regions"))

       p <- ggplot(metadata, aes(x = Location_Center_X , y = Location_Center_Y , colour = selected_region  )) + geom_scattermore(pointsize = 1.5) + ggtitle(thisregion) + scale_colour_manual(values = c("grey" , "red"))
         
       
    
       p2 <-  ggplot(metadata, aes(x = Location_Center_X , y = Location_Center_Y , colour =  celltype )) + geom_scattermore(pointsize = 1.5) + ggtitle(thisregion) + scale_colour_manual(values =  color_codes)
       
      plotlist [[thisregion ]] <- p
       
      plotlist_celltype [[thisregion ]] <- p2
}

ggarrange(plotlist = plotlist , ncol = 5, nrow = 1 , common.legend = T )

ggarrange(plotlist = plotlist_celltype , ncol = 5, nrow = 1 , common.legend = T )

Visualise regions across patients

We can better interpret the region output by summarising the proportion of each cell type in a region across the patients. Looking at the composition of cell types in each region, comparing between responder and non-responders.

df <- data.frame(colData( data_sce))
 

df_plot <- NULL
for ( thispatient in unique(df$Sample_ID)){
  this_df <- df[df$Sample_ID == thispatient, ]
  temp_df <-   table( this_df$region, this_df$Cluster.v2 )
  temp_df <-  temp_df / rowSums(temp_df)
   temp_df <- data.frame(  temp_df)
  temp_df$patient <-  thispatient
  temp_df$reponse <- unique( this_df$Response )
  df_plot <- rbind(df_plot, temp_df)
}

df_plot <- df_plot %>% dplyr::group_by( Var1 , Var2, reponse) %>% 
  summarise(mean_proportion = mean(Freq))
  
# r1 <- df_plot[ df_plot$Var1 == "region_1", ]  

ggplot(df_plot, aes(y = Var2, x = Var1 ,colour =mean_proportion  , size = mean_proportion ))+  geom_point() + 
  facet_grid(~reponse, scales = "free", space = "free" ) + 
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) + 
  xlab("Region" ) + ylab("Celltype") + scale_colour_viridis_c()

Interactive Q&A:

Q4: Which regions appear to be different between responders and non-responders?

df <- data.frame(colData( data_sce))

df <- df %>% dplyr::group_by(Sample_ID ,Response, region) %>%
  count(Cluster.v2) %>%
  mutate(proportion = n / sum(n))


ggplot(df, aes(y = proportion, x = Sample_ID , fill = Cluster.v2))+ geom_col()+facet_grid(~region+Response, scales = "free", space = "free" ) + scale_fill_manual(values = c(color_codes))  +
  theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank())

Interactive Q&A:

Q5: Does your conclusion change after looking at a different plot?

3.3 Further exploration by visualising selected regions

We see that region 1 appears to suggest the non-responder patients have more melano. Region 3 appears to be the tumor microenvironment with lots of Th.ae (antigen-experienced) and macro.mono (macrophage and monocytes) cell types. Let’s focus on region 1 and region 3 and visualise the data with boxplots, as well as comparing to the overall cell type proportion without segmenting into regions.

df <- data.frame(colData( data_sce))

df_plot <- NULL
for ( thispatient in unique(df$Sample_ID)){
  this_df <- df[df$Sample_ID == thispatient, ]
  temp_df <-   table( this_df$region, this_df$Cluster.v2 )
  temp_df <-  temp_df / rowSums(temp_df)
  temp_df <- data.frame(  temp_df)
  temp_df$patient <-  thispatient
  temp_df$reponse <- unique( this_df$Response )
  df_plot <- rbind(df_plot, temp_df)
}

df_plot_region_1 <- df_plot[df_plot$Var1 == "region_1", ]
 
a <- ggplot(df_plot_region_1, aes(x =  Var2,  y = Freq, colour = reponse)) +
  geom_boxplot()+ 
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) + 
  ylab("Proportion") + xlab("Cell type")+ ggtitle("Region 1") + ylim(0,1)


df_plot_region_3 <- df_plot[df_plot$Var1 == "region_3", ]

b <- ggplot(df_plot_region_3, aes(x =  Var2,  y = Freq, colour = reponse)) +
  geom_boxplot()+ 
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))  + 
  ylab("Proportion") + xlab("Cell type") + ggtitle("Region 3")
 


overall <- NULL
for ( thispatient in unique(df$Sample_ID)){
  
  this_df <- df[df$Sample_ID == thispatient, ]
  
  temp_df <-   table(  this_df$Cluster.v2 )
  temp_df <-  temp_df /sum(temp_df)
   temp_df <- data.frame(  temp_df)
  temp_df$patient <-  thispatient
  temp_df$reponse <- unique( this_df$Response )
  overall <- rbind(overall, temp_df)
}


c <- ggplot(overall, aes(x =  Var1,  y = Freq, colour = reponse)) +
  geom_boxplot()+ 
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))  + 
  ylab("Proportion") + xlab("Cell type") + ggtitle("Overall composition")
 

a + b + c

Selecting a specific marker for further exploration

Often you may have a marker in mind to further examine the expression of key marker genes in these region specific cell types. For example, we select cells that have high Ki67 expression. (ie, only keeping the cells that have Ki67 expression higher than the median Ki67 expression in the whole dataset). We choose Ki67 as an example here because Ki67 is strongly associated with tumor cell proliferation and growth and is widely used as a biomarker in cancer analysis.

median_ki67 <- median( logcounts(data_sce[ "Ki67" , ]))
data_sce$ki67 <- ifelse( logcounts(data_sce[ "Ki67" , ]) > median_ki67, "high_ki67" , "low_ki67")


df_plot <- NULL
for ( thispatient in unique(df$Sample_ID)){
  this_df <- df[df$Sample_ID == thispatient, ]
  temp_df <-   table( this_df$region, this_df$Cluster.v2 )
  temp_df <-  temp_df / rowSums(temp_df)
  temp_df <- data.frame(  temp_df)
  temp_df$patient <-  thispatient
  temp_df$reponse <- unique( this_df$Response )
  df_plot <- rbind(df_plot, temp_df)
}

df_plot_region_1 <- df_plot[df_plot$Var1 == "region_1", ]
 
a <- ggplot(df_plot_region_1, aes(x =  Var2,  y = Freq, colour = reponse)) +
  geom_boxplot()+ 
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) + 
  ylab("Proportion") + xlab("Cell type")+ ggtitle("Region 1") + ylim(0,1)


df_plot_region_3 <- df_plot[df_plot$Var1 == "region_3", ]

b <- ggplot(df_plot_region_3, aes(x =  Var2,  y = Freq, colour = reponse)) +
  geom_boxplot()+ 
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))  + 
  ylab("Proportion") + xlab("Cell type") + ggtitle("Region 3")
 


overall <- NULL
for ( thispatient in unique(df$Sample_ID)){
  
  this_df <- df[df$Sample_ID == thispatient, ]
  
  temp_df <-   table(  this_df$Cluster.v2 )
  temp_df <-  temp_df /sum(temp_df)
   temp_df <- data.frame(  temp_df)
  temp_df$patient <-  thispatient
  temp_df$reponse <- unique( this_df$Response )
  overall <- rbind(overall, temp_df)
}


c <- ggplot(overall, aes(x =  Var1,  y = Freq, colour = reponse)) +
  geom_boxplot()+ 
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))  + 
  ylab("Proportion") + xlab("Cell type") + ggtitle("Overall composition")
 

a + b + c

Discussion:

Comparing the overall composition and the cell type composition in the region, what did you observe about the regions?

4 How do we generate a molecular representation for each individual?

In this demo, we use scFeatures to generate molecular representation for each patient from the matrix of proteins by cells. The molecular representation is interpretable and hence facilitates downstream analysis of the patient. Overall, scFeatures generates features across six categories representing different molecular views of cellular characteristics. These include:

  1. cell type proportions
  2. cell type specific gene expressions
  3. cell type specific pathway expressions
  4. cell type specific cell-cell interaction (CCI) scores
  5. overall aggregated gene expressions
  6. spatial metrics

The different types of features constructed enable a more comprehensive multi-view understanding of each patient from a matrix of proteins x cells.

Given in the previous section, we clustered the cells into regions, we can use the region information as an additional layer of information on top of the cell types to examine region-specific cell-type specific features.

region <- data_sce$region
region <- gsub("_" , "", region)
data_sce$celltype <- paste0( data_sce$Cluster.v2 , "-" , region)

print("number of cells in each sample")
## [1] "number of cells in each sample"
table(data_sce$Sample_ID) 
## 
## 01RD 02RD 04RD 05RD 06RD 08BL 09RD 10RD 12RD 13RD 14RD 16BL 19BL 21RD 22RD 23RD 
## 2539 2556 3448 4460 4176 1924 5605 4537 2796 4661 2105 2408 2626 5411 4742 2611 
## 24RD 25RD 26BL 29RD 31RD 32RD 33RD 34RD 35RD 37RD 39RD 40RD 41BL 42RD 
## 4059 5774 3968 2230 4522 5366 4061 2478 4325 4032 2391 3730 4848 4108
print("number of cells in each celltype - Region specific cell type")
## [1] "number of cells in each celltype - Region specific cell type"
table(data_sce$celltype) 
## 
##            B-region1            B-region2            B-region3 
##                   87                 2942                  649 
##            B-region4            B-region5         CD31-region1 
##                   68                  148                  164 
##         CD31-region2         CD31-region3         CD31-region4 
##                  932                  205                  371 
##         CD31-region5   macro.mono-region1   macro.mono-region2 
##                 1320                  752                 6395 
##   macro.mono-region3   macro.mono-region4   macro.mono-region5 
##                  668                 1385                 3999 
##       melano-region1       melano-region2       melano-region3 
##                 6754                37987                  493 
##       melano-region4       melano-region5        other-region1 
##                 2549                 3232                 1278 
##        other-region2        other-region3        other-region4 
##                14848                  338                 7605 
##        other-region5        Tc.ae-region1        Tc.ae-region2 
##                 2954                  295                 1486 
##        Tc.ae-region3        Tc.ae-region4        Tc.ae-region5 
##                  766                  347                 1694 
##     Tc.naive-region1     Tc.naive-region2     Tc.naive-region3 
##                  101                  190                  146 
##     Tc.naive-region4     Tc.naive-region5        Th.ae-region1 
##                  149                  475                  165 
##        Th.ae-region2        Th.ae-region3        Th.ae-region4 
##                  659                 1890                  148 
##        Th.ae-region5     Th.naive-region1     Th.naive-region2 
##                  651                   42                   98 
##     Th.naive-region3     Th.naive-region4     Th.naive-region5 
##                  622                   45                  220 
##         Treg-region1         Treg-region2         Treg-region3 
##                   50                  195                  427 
##         Treg-region4         Treg-region5 unclassified-region1 
##                   48                  236                  187 
## unclassified-region2 unclassified-region3 unclassified-region4 
##                 1938                  187                  465 
## unclassified-region5 
##                  452

Discussion:

Are there any samples or cell types you would like to remove from the data?

How to create molecular representations of patients

All the feature types can be generated in one line of code. This runs the function using default settings for all parameters. For more information, type ?scFeatures.

# scFeatures requires the following columns 
# celltype, sample, x_cord and y_cord
# alternatively, these can be also set as argument in the scFeatures function 
 
data_sce$sample <- data_sce$Sample_ID
data_sce$x_cord <- data_sce$Location_Center_X
data_sce$y_cord <- data_sce$Location_Center_Y

# here, we specify that this is a spatial proteomics data
# scFeatures support parallel computation to speed up the process 
scfeatures_result <- scFeatures(data_sce , type = "spatial_p" , ncores = 10 )

4.1 How to visualise and explore scFeatures output?

We have generated a total of 13 feature types and stored them in a list. All generated feature types are stored in a matrix of samples by features. For example, the first list element contains the feature type proportion_raw, which contains the cell type proportion features for each patient sample. We could print out the first 5 columns and first 5 rows of the first element to see.

scfeatures_result <- readRDS("scfeatures_result_region_specific.rds")

# combine sample name with outcome 
scfeatures_result_format <- scfeatures_result
outcome <- data_sce[, match( rownames(scfeatures_result[[1]]), data_sce$Sample_ID)]$Response
outcome <- unname(outcome) 

for ( i in c(1:length(scfeatures_result_format))){
  this <- scfeatures_result_format[[i]]
  rownames(this) <- paste0(rownames(this), "_cond_", outcome )
  scfeatures_result_format[[i]] <- this
}

# we have generated a total of 13 feature types
names(scfeatures_result_format)
##  [1] "proportion_raw"       "proportion_logit"     "proportion_ratio"    
##  [4] "gene_mean_celltype"   "gene_prop_celltype"   "gene_cor_celltype"   
##  [7] "gene_mean_bulk"       "gene_prop_bulk"       "gene_cor_bulk"       
## [10] "L_stats"              "celltype_interaction" "morans_I"            
## [13] "nn_correlation"
# each row is a sample, each column is a feature 
data.frame(scfeatures_result_format[[1]][1:5, 1:5])
##                B.region1   B.region2  B.region3    B.region4    B.region5
## 26BL_cond_No 0.000000000 0.000000000 0.00000000 0.0000000000 0.0005040323
## 19BL_cond_No 0.000000000 0.000000000 0.00000000 0.0000000000 0.0000000000
## 08BL_cond_No 0.000000000 0.001039501 0.00000000 0.0005197505 0.0015592516
## 41BL_cond_No 0.000000000 0.000000000 0.00000000 0.0000000000 0.0000000000
## 31RD_cond_No 0.008624502 0.030075188 0.03803627 0.0006634233 0.0055285272
## DT::datatable(meta_table , options = list(scrollX = TRUE))

Once the features are generated, you may wish to visually explore the features.

Here, we plot a volcano plot and a dotplot for the region specific cell type specific expression feature.

gene_mean_celltype <- scfeatures_result_format$gene_mean_celltype
# this transposes the data
# in bioinformatics conversion, features are stored in rows 
# in statistics convention, features are stored in columns
gene_mean_celltype <- t(gene_mean_celltype)
      
# pick CD31-region5 as an example cell type 
gene_mean_celltype <- gene_mean_celltype[ grep("B-region4", rownames(gene_mean_celltype)), ]
condition  <- unlist( lapply( strsplit( colnames(gene_mean_celltype), "_cond_"), `[`, 2))
condition <- data.frame(condition = condition )
design <- model.matrix(~condition, data = condition)
fit <- lmFit(gene_mean_celltype, design)
fit <- eBayes(fit)
tT <- topTable(fit, n = Inf)
tT$gene <- rownames(tT)
p <- ggplot( tT , aes(logFC,-log10(P.Value) , text = gene ) )+
      geom_point(aes(colour=-log10(P.Value)), alpha=1/3, size=1) +
      scale_colour_gradient(low="blue",high="red")+
      xlab("log2 fold change") + ylab("-log10 p-value") + theme_minimal()
 
a <- ggplotly(p) 

a
b <- ggplot( tT , aes( y = reorder(gene, logFC) , x = logFC  ) )+
      geom_point(aes(colour=-log10(P.Value)), alpha=1/3, size=1) +
      scale_colour_gradient(low="blue",high="red")+
      xlab("logFC") + ylab("region specific cel type specfic features" ) + theme_minimal()

b

Interactive Q&A:

Q6: Which figure do you prefer? The volcano plot or the dotplot?

To accommodate for easier interpretation of the features, scFeatures contains a function run_association_study_report that enables the user to readily visualise and explore all generated features with one line of code.

Tips: Some categories of features such as cell cell interaction takes a long time to run. You may use parallel computation to speed up the process or select a small number of feature categories to reduce computational time.

# specify a folder to store the html report. Here we store it in the current working directory. 
output_folder <-  getwd()
run_association_study_report(scfeatures_result, output_folder )

4.2 Are the generated features sensible?

Interactive Q&A:

Using the HTML, we can look at some of the critical thinking questions that a researcher would ask about the generated features. These questions are exploratory and there is no right or wrong answer.

Q7: Do the generated features look reasonable?

  • Which cell type(s) would you like to focus on at your next stage of analysis?
  • Which feature type(s) would you like to focus on at your next stage of analysis?
  • Are the conditions in your data relatively easy or difficult to distinguish?

5 Can we classify or discrimiante between responders and non-responders?

Building a classification model

In this section, we will perform disease outcome classification using the molecular representation of patients. Recall in the previous section that we have stored the 13 feature types matrix in a list. Instead of manually retrieving each matrix from the list to build separate models, classifyR can directly take a list of matrices as an input and run a repeated cross-validation model on each matrix individually. Below, we run 5 repeats of 3-fold cross-validation.

outcome <- data_sce[, match( rownames(scfeatures_result[[1]]), data_sce$Sample_ID)]$Response
outcome <- unname(outcome) 


### generate classfyr result 

classifyr_result <- crossValidate(scfeatures_result,
                                 outcome, 
                                 classifier = "kNN",
                                 nFolds = 3, 
                                 nRepeats = 5, 
                                 nCores = 20  )

Visualising the classification performance

To examine the classification model performance, we first need to specify a metric to calculate. Here, we calculate the balanced accuracy and visualise the accuracy using boxplots.

classifyr_result <-  readRDS("classifyr_result_region_specific.rds")
classifyr_result <- lapply(classifyr_result, 
                           function(x) calcCVperformance(x, performanceType = "Balanced Accuracy"))

level_order <- names(scfeatures_result)
p  <- performancePlot(classifyr_result) + 
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) + 
  scale_x_discrete(limits = level_order)  
p

Interactive Q&A:

Q8: Based on the classification performance and survival (see Appendix below), which feature type would you like to focus on at your next stage of analysis?

6 Appendix

6.1 PART A: Survival analysis

The dataset has a survival outcome. Apart from performing prediction on responder versus non-responder, here we highlight the use of scFeatures for survival analysis.

survival_day <-  unname( data_sce[, match( rownames(scfeatures_result[[1]]), data_sce$Sample_ID)]$Survival_from_Rx_Start)
censoring <-  unname( data_sce[, match( rownames(scfeatures_result[[1]]), data_sce$Sample_ID)]$Known_Deceased)

 i <- 13
 
plotlist <- list()
for (i in c(1 : length( scfeatures_result ))){
  feature_name <- names(scfeatures_result)[i]
  feature <-  scfeatures_result[[i]]
  feature <- t(feature)
  # run hierarchical clustering
  hclust_res <- hclust(
    as.dist(1 - cor(feature )),
    method = "ward.D2"
  )
  
  cluster_res <- cutree(hclust_res, k = 2)
  
  metadata <- data.frame( cluster = factor(cluster_res),
                          survival_day = survival_day,
                          censoring = censoring)
  
  # plot survival curve
  fit <- survfit(
    Surv(survival_day, censoring) ~ cluster,
    data = metadata
  )
  ggsurv <- ggsurvplot(fit,
                       conf.int = FALSE, risk.table = TRUE,
                       risk.table.col = "strata", pval = TRUE,
                       xlim = c(0,700), break.time.by = 100 
                        
  ) + ggtitle(  feature_name)
  
  plotlist[[feature_name]] <-   ggsurv 
 
}

 arrange_ggsurvplots(  plotlist, print = TRUE,
  ncol = 3 , risk.table.height = 0.3)

6.2 PART B: Explanation of spatial features

  • L function:

The L function is a spatial statistic used to assess the spatial distribution of cell types. It assesses the significance of cell-cell interactions, by calculating the density of a cell type with other cell types within a certain radius. High values indicate spatial association, low values indicate spatial avoidance.

tableau_palette <- scale_colour_tableau() 
color_codes <- tableau_palette$palette( 10 )
color_codes <- c(color_codes, "darkgrey" , "grey90") 

names(color_codes) <- c( unique(data_sce$Cluster.v2) ,  "other regions")
 
one_sample  <- data_sce[ , data_sce$Sample_ID == "16BL"  ]
one_sample <- data.frame( colData(one_sample) )

one_sample$celltype <- one_sample$Cluster.v2
index <-  one_sample$celltype  %in% c("macro.mono", "Tc.ae")
one_sample$celltype[!index] <- "others"
a <-ggplot( one_sample, aes(x = Location_Center_X , y = Location_Center_Y, colour = celltype )) + geom_point()  + scale_colour_manual(values = color_codes)  + ggtitle( "Patient 16BL - high L value with \n macro.mono interacting Tc.ae")
 

one_sample$celltype <- one_sample$Cluster.v2
index <-  one_sample$celltype  %in% c("melano", "Tc.ae")
one_sample$celltype[!index] <- "others"
b <- ggplot( one_sample, aes(x = Location_Center_X , y = Location_Center_Y, colour = celltype )) + geom_point()  + scale_colour_manual(values = color_codes)  + ggtitle( "Patient 16BL - low L value with  \n melano interacting Tc.ae")
 
a + b

  • Cell type interaction composition:

We calculate the nearest neighbours of each cell and then calculate the pairs of cell types based on the nearest neighbour. This allows us to summarise it into a cell type interaction composition.

tableau_palette <- scale_colour_tableau() 
color_codes <- tableau_palette$palette( 10 )
color_codes <- c(color_codes, "darkgrey" , "grey90") 

names(color_codes) <- c( unique(data_sce$Cluster.v2) ,  "other regions")
 
one_sample  <- data_sce[ , data_sce$Sample_ID == "16BL"  ]
one_sample <- data.frame( colData(one_sample) )

one_sample$celltype <- one_sample$Cluster.v2
 
a <-ggplot( one_sample, aes(x = Location_Center_X , y = Location_Center_Y, colour = celltype )) + geom_point()  + scale_colour_manual(values = color_codes)  + ggtitle( "Patient 16BL")


feature  <- scfeatures_result$celltype_interaction
to_plot <- data.frame( t( feature["16BL", ])  )
to_plot$feature <- rownames(to_plot) 
colnames(to_plot)[2] <- "celltype interaction composition"
 
to_plot <- to_plot[ to_plot$X16BL > 0.03 , ] 
b <- ggplot(to_plot, aes( x =  reorder(`celltype interaction composition`, X16BL) ,  y = X16BL, fill=`celltype interaction composition`)) + geom_bar(stat="identity" ) + ylab("Major cell type interactions")  +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) 

a+ b

  • Moran’s I:

Moran’s I is a spatial autocorrelation statistic based on both location and values. It quantifies whether similar values tend to occur near each other or are dispersed.

high  <- data_sce["Ki67", data_sce$Sample_ID == "25RD"  ]
high_meta <- data.frame( colData(high) ) 
high_meta$expression <- as.vector(logcounts( high)) 

low  <- data_sce["Ki67",  data_sce$Sample_ID == "42RD" ]
low_meta <- data.frame( colData(low) )
low_meta$expression <- as.vector(logcounts(low))


a <- ggplot(high_meta, aes(x = Location_Center_X , y = Location_Center_Y, colour =expression)) + geom_point(alpha=0.5) + scale_colour_viridis_c() + ggtitle("Patient 25RD - high Moran's I in Ki67")

b <- ggplot(low_meta, aes(x = Location_Center_X , y = Location_Center_Y, colour =expression)) + geom_point(alpha=0.5) + scale_colour_viridis_c() + ggtitle("Patient 42RD - low Moran's I in Ki67")

a+b

  • Nearest Neighbor Correlation:

This metric measures the correlation of proteins/genes between a cell and its nearest neighbour cell.

We see from both prediction and survival analysis that the feature type “nn correlation” (nearest neighbour correlation) performs the best.

Here we pick the protein “S100”, a key player in cancer, as an example to illustrate the concept.

 plot_nncorrelation <- function(thissample , thisprotein){
   
       sample_name <- thissample
       thissample <- data_sce[, data_sce$Sample_ID ==     sample_name]
    
      
      exprsMat <- logcounts(thissample)
     
    
    cell_points_cts <- spatstat.geom::ppp(
            x = as.numeric(thissample$Location_Center_X ), y = as.numeric(thissample$Location_Center_Y),
            check = FALSE,
            xrange = c(
                min(as.numeric(thissample$Location_Center_X)),
                max(as.numeric(thissample$Location_Center_X))
            ),
            yrange = c(
                min(as.numeric(thissample$Location_Center_Y)),
                max(as.numeric(thissample$Location_Center_Y))
            ),
            marks = t(as.matrix(exprsMat))
        )
    
     value <-  spatstat.explore::nncorr(cell_points_cts)["correlation", ]
      value <-  value[  thisprotein]
     
    # Find the indices of the two nearest neighbors for each cell
    nn_indices <- nnwhich(cell_points_cts, k = 1)
    
    protein <-  thisprotein
    df <- data.frame(thiscell_exprs  = exprsMat[protein, ] , exprs =  exprsMat[protein,nn_indices ])
    
   p <-  ggplot(df, aes( x =thiscell_exprs ,  y = exprs , colour =  exprs  )) +
      geom_point(alpha = 0.3) + ggtitle(paste0( "Patient ", sample_name ,  " nn_corr = " ,  round(value, 2)  )) + scale_colour_viridis_c() 
   
   return (p ) 

}

    
p1 <- plot_nncorrelation( "42RD" ,  "S100" )
p2 <- plot_nncorrelation( "29RD" ,  "S100" )
p1  + p2   

The correlation differs between the 42RD patient (from cluster 1) and the 29RD patient (from cluster 2).

6.3 PART C: SessionInfo

sessionInfo()
## R version 4.3.1 (2023-06-16)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Debian GNU/Linux 12 (bookworm)
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.21.so;  LAPACK version 3.11.0
## 
## locale:
##  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
##  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
##  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
## [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
## 
## time zone: Australia/Sydney
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] scran_1.28.1                scater_1.28.0              
##  [3] scuttle_1.10.1              lisaClust_1.8.0            
##  [5] spatstat_3.0-6              spatstat.linnet_3.1-1      
##  [7] spatstat.model_3.2-4        rpart_4.1.19               
##  [9] spatstat.explore_3.2-1      nlme_3.1-162               
## [11] spatstat.random_3.1-5       spatstat.geom_3.2-1        
## [13] spatstat.data_3.0-1         survminer_0.4.9            
## [15] ggpubr_0.6.0                tidyr_1.3.0                
## [17] scattermore_0.8             plotly_4.10.2              
## [19] limma_3.56.2                dplyr_1.1.2                
## [21] spicyR_1.12.0               ggthemes_4.2.4             
## [23] ClassifyR_3.4.8             survival_3.5-5             
## [25] BiocParallel_1.34.2         MultiAssayExperiment_1.26.0
## [27] generics_0.1.3              scFeatures_0.99.27         
## [29] ggplot2_3.4.2               SingleCellExperiment_1.22.0
## [31] SummarizedExperiment_1.30.2 Biobase_2.60.0             
## [33] GenomicRanges_1.52.0        GenomeInfoDb_1.36.1        
## [35] IRanges_2.34.1              S4Vectors_0.38.1           
## [37] BiocGenerics_0.46.0         MatrixGenerics_1.12.2      
## [39] matrixStats_1.0.0          
## 
## loaded via a namespace (and not attached):
##   [1] SpatialExperiment_1.10.0   R.methodsS3_1.8.2         
##   [3] GSEABase_1.62.0            progress_1.2.2            
##   [5] tiff_0.1-11                EnsDb.Mmusculus.v79_2.99.0
##   [7] goftest_1.2-3              Biostrings_2.68.1         
##   [9] HDF5Array_1.28.1           vctrs_0.6.1               
##  [11] digest_0.6.32              png_0.1-8                 
##  [13] shape_1.4.6                EnsDb.Hsapiens.v79_2.99.0 
##  [15] ggrepel_0.9.3              deldir_1.0-9              
##  [17] parallelly_1.36.0          magick_2.7.4              
##  [19] MASS_7.3-59                reshape2_1.4.4            
##  [21] httpuv_1.6.11              foreach_1.5.2             
##  [23] withr_2.5.0                xfun_0.39                 
##  [25] ellipsis_0.3.2             commonmark_1.9.0          
##  [27] cytomapper_1.12.0          memoise_2.0.1             
##  [29] proxyC_0.3.3               ggbeeswarm_0.7.2          
##  [31] systemfonts_1.0.4          zoo_1.8-12                
##  [33] GlobalOptions_0.1.2        gtools_3.9.4              
##  [35] SingleCellSignalR_1.12.0   pbapply_1.7-2             
##  [37] R.oo_1.25.0                prettyunits_1.1.1         
##  [39] KEGGREST_1.40.0            promises_1.2.0.1          
##  [41] httr_1.4.6                 rstatix_0.7.2             
##  [43] restfulr_0.0.15            globals_0.16.2            
##  [45] fitdistrplus_1.1-11        rhdf5filters_1.12.1       
##  [47] rhdf5_2.44.0               rstudioapi_0.14           
##  [49] miniUI_0.1.1.1             concaveman_1.1.0          
##  [51] babelgene_22.9             curl_5.0.1                
##  [53] zlibbioc_1.46.0            ScaledMatrix_1.8.1        
##  [55] polyclip_1.10-4            GenomeInfoDbData_1.2.10   
##  [57] fftwtools_0.9-11           xtable_1.8-4              
##  [59] stringr_1.5.0              evaluate_0.21             
##  [61] S4Arrays_1.0.4             BiocFileCache_2.8.0       
##  [63] hms_1.1.3                  irlba_2.3.5.1             
##  [65] colorspace_2.1-0           filelock_1.0.2            
##  [67] ROCR_1.0-11                reticulate_1.30           
##  [69] magrittr_2.0.3             lmtest_0.9-40             
##  [71] viridis_0.6.3              later_1.3.1               
##  [73] lattice_0.21-8             future.apply_1.11.0       
##  [75] XML_3.99-0.14              cowplot_1.1.1             
##  [77] RcppAnnoy_0.0.21           svgPanZoom_0.3.4          
##  [79] class_7.3-22               pillar_1.9.0              
##  [81] simpleSeg_1.2.0            EBImage_4.42.0            
##  [83] iterators_1.0.14           caTools_1.18.2            
##  [85] compiler_4.3.1             beachmat_2.16.0           
##  [87] stringi_1.7.12             tensor_1.5                
##  [89] minqa_1.2.5                GenomicAlignments_1.36.0  
##  [91] plyr_1.8.8                 msigdbr_7.5.1             
##  [93] crayon_1.5.2               abind_1.4-5               
##  [95] BiocIO_1.10.0              ggtext_0.1.2              
##  [97] locfit_1.5-9.8             sp_2.0-0                  
##  [99] terra_1.7-39               bit_4.0.5                 
## [101] codetools_0.2-19           BiocSingular_1.16.0       
## [103] crosstalk_1.2.0            bslib_0.5.0               
## [105] multtest_2.56.0            mime_0.12                 
## [107] splines_4.3.1              markdown_1.7              
## [109] circlize_0.4.15            Rcpp_1.0.10               
## [111] dbplyr_2.3.2               sparseMatrixStats_1.12.2  
## [113] gridtext_0.1.5             knitr_1.43                
## [115] blob_1.2.4                 utf8_1.2.3                
## [117] AnnotationFilter_1.24.0    lme4_1.1-34               
## [119] nnls_1.4                   listenv_0.9.0             
## [121] DelayedMatrixStats_1.22.1  GSVA_1.48.2               
## [123] ggsignif_0.6.4             tibble_3.2.1              
## [125] Matrix_1.5-4.1             scam_1.2-14               
## [127] statmod_1.5.0              svglite_2.1.1             
## [129] tweenr_2.0.2               pkgconfig_2.0.3           
## [131] pheatmap_1.0.12            tools_4.3.1               
## [133] cachem_1.0.8               RSQLite_2.3.1             
## [135] viridisLite_0.4.2          DBI_1.1.3                 
## [137] numDeriv_2016.8-1.1        fastmap_1.1.1             
## [139] rmarkdown_2.23             scales_1.2.1              
## [141] grid_4.3.1                 ica_1.0-3                 
## [143] Seurat_4.3.0.1             shinydashboard_0.7.2      
## [145] Rsamtools_2.16.0           broom_1.0.5               
## [147] sass_0.4.6                 patchwork_1.1.2           
## [149] BiocManager_1.30.21        graph_1.78.0              
## [151] carData_3.0-5              RANN_2.6.1                
## [153] farver_2.1.1               mgcv_1.8-42               
## [155] yaml_2.3.7                 rtracklayer_1.60.0        
## [157] cli_3.6.1                  purrr_1.0.1               
## [159] leiden_0.4.3               lifecycle_1.0.3           
## [161] uwot_0.1.16                bluster_1.10.0            
## [163] backports_1.4.1            DropletUtils_1.20.0       
## [165] annotate_1.78.0            gtable_0.3.3              
## [167] rjson_0.2.21               ggridges_0.5.4            
## [169] progressr_0.13.0           parallel_4.3.1            
## [171] ape_5.7-1                  jsonlite_1.8.7            
## [173] edgeR_3.42.4               bitops_1.0-7              
## [175] bit64_4.0.5                Rtsne_0.16                
## [177] spatstat.utils_3.0-3       BiocNeighbors_1.18.0      
## [179] SeuratObject_4.1.3         RcppParallel_5.1.7        
## [181] highr_0.10                 jquerylib_0.1.4           
## [183] metapod_1.8.0              dqrng_0.3.0               
## [185] survMisc_0.5.6             R.utils_2.12.2            
## [187] lazyeval_0.2.2             shiny_1.7.4               
## [189] htmltools_0.5.5            KMsurv_0.1-5              
## [191] sctransform_0.3.5          rappdirs_0.3.3            
## [193] ensembldb_2.24.0           glue_1.6.2                
## [195] XVector_0.40.0             RCurl_1.98-1.12           
## [197] jpeg_0.1-10                gridExtra_2.3             
## [199] AUCell_1.22.0              boot_1.3-28.1             
## [201] igraph_1.5.0               R6_2.5.1                  
## [203] gplots_3.1.3               labeling_0.4.2            
## [205] km.ci_0.5-6                GenomicFeatures_1.52.1    
## [207] cluster_2.1.4              Rhdf5lib_1.22.0           
## [209] nloptr_2.0.3               vipor_0.4.5               
## [211] DelayedArray_0.26.6        tidyselect_1.2.0          
## [213] ProtGenerics_1.32.0        raster_3.6-23             
## [215] ggforce_0.4.1              xml2_1.3.4                
## [217] car_3.1-2                  AnnotationDbi_1.62.2      
## [219] future_1.33.0              rsvd_1.0.5                
## [221] munsell_0.5.0              KernSmooth_2.23-21        
## [223] data.table_1.14.8          htmlwidgets_1.6.2         
## [225] RColorBrewer_1.1-3         biomaRt_2.56.1            
## [227] rlang_1.1.1                spatstat.sparse_3.0-2     
## [229] lmerTest_3.1-3             fansi_1.0.4               
## [231] beeswarm_0.4.0

6.4 Acknowledgment

The authors thank all their colleagues, particularly at The University of Sydney, Sydney Precision Data Science and Charles Perkins Centre for their support and intellectual engagement. Special thanks to Ellis Patrick, Shila Ghazanfar, Andy Tran for guiding and supporting the building of this workshop.